I am incredibly excited that RStudio has begun an instructor certification program based on the Carpentries, so of course I signed up as soon as my overcommited nature allowed! This also provides me with the excuse and motivation to finally formally work my way through R for Data Science, a book I have read while waiting for GTT tests during my pregnancy and google-landed upon an umpteen number of times while debugging code, but never taken the time to sit down and do the exercises for - and of course the pedagogue in me knows quite well that THAT is how you actually learn and internalise the principles and concepts in any material, especially if it deals with programming and analysis. So over the next few weeks I plan to work my way through R4DS, and this post is the first in which I dive into the exercises.

Notes on section I: Explore

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.2.1     ✓ purrr   0.3.3
## ✓ tibble  2.1.3     ✓ dplyr   0.8.3
## ✓ tidyr   1.0.0     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.4.0
## ── Conflicts ────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
# devtools::install_github("thomasp85/patchwork")
library(patchwork)

Steps of the data pipeline:

  • Import: take data stored in a file, database, or web API, and load it into a data frame in R.

Wrangling:

  • Tidying - storing data in a consistent form that matches the semantics of the dataset with the way it is stored. In brief, when your data is tidy, each column is a variable, and each row is an observation.

  • Transformation
    • narrowing in on observations of interest (like all people in one city, or all data from the last year),
    • creating new variables that are functions of existing variables (like computing speed from distance and time),
    • calculating a set of summary statistics (like counts or means). Together, tidying and transforming are called wrangling

Small data vs big data

Is big data really big? Two ways of thinking small about big data

Sampling

Sampling may be enough to answer the question.

Your big data problem is actually a large number of small data problems

  • Each individual problem might fit in memory, but you have millions of them. For example, you might want to fit a model to each person in your dataset. That would be trivial if you had just 10 or 100 people, but instead you have a million.
  • So you need a system (like Hadoop or Spark) that allows you to send different datasets to different computers for processing.
  • Once you’ve figured out how to answer the question for a single subset using the tools described in this book, you can use tools like sparklyr, rhipe, and ddr to solve it for the full dataset.

New (to me) ggplot() aesthetics

  • stroke - is either the size of the point (for a default geom_point()) OR, if used with shape 21-25, which have both a colour and a fill, is the thickness of the stroke around the plotted shape.

  • You can generally use geoms and stats interchangeably! For example, you can use stat_count() instead of geom_bar() to make the same plot!

ggplot(data = diamonds) + geom_bar(aes(x = cut, y = ..count.. / sum(..count..), fill = color))

# not really new, but I'm sure I'll forget position = "fill"
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = clarity), position = "fill")

# pie chart from bar
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = 1, fill = clarity)) + coord_polar(theta = "y")

On average, humans are best able to perceive differences in angles relative to 45 degrees. The function ggthemes::bank_slopes() will calculate the optimal aspect ratio to bank slopes to 45-degrees.

Very clear table of ggplot mappings (from here)

geom default stat shared docs
geom_abline()
geom_hline()
geom_vline()
geom_bar() stat_count() x
geom_col()
geom_bin2d() stat_bin_2d() x
geom_blank()
geom_boxplot() stat_boxplot() x
geom_countour() stat_countour() x
geom_count() stat_sum() x
geom_density() stat_density() x
geom_density_2d() stat_density_2d() x
geom_dotplot()
geom_errorbarh()
geom_hex() stat_hex() x
geom_freqpoly() stat_bin() x
geom_histogram() stat_bin() x
geom_crossbar()
geom_errorbar()
geom_linerange()
geom_pointrange()
geom_map()
geom_point()
geom_map()
geom_path()
geom_line()
geom_step()
geom_point()
geom_polygon()
geom_qq_line() stat_qq_line() x
geom_qq() stat_qq() x
geom_quantile() stat_quantile() x
geom_ribbon()
geom_area()
geom_rug()
geom_smooth() stat_smooth() x
geom_spoke()
geom_label()
geom_text()
geom_raster()
geom_rect()
geom_tile()
geom_violin() stat_ydensity() x
geom_sf() stat_sf() x
stat default geom shared docs
stat_ecdf() geom_step()
stat_ellipse() geom_path()
stat_function() geom_path()
stat_identity() geom_point()
stat_summary_2d() geom_tile()
stat_summary_hex() geom_hex()
stat_summary_bin() geom_pointrange()
stat_summary() geom_pointrange()
stat_unique() geom_point()
stat_count() geom_bar() x
stat_bin_2d() geom_tile() x
stat_boxplot() geom_boxplot() x
stat_countour() geom_contour() x
stat_sum() geom_point() x
stat_density() geom_area() x
stat_density_2d() geom_density_2d() x
stat_bin_hex() geom_hex() x
stat_bin() geom_bar() x
stat_qq_line() geom_path() x
stat_qq() geom_point() x
stat_quantile() geom_quantile() x
stat_smooth() geom_smooth() x
stat_ydensity() geom_violin() x
stat_sf() geom_rect() x

##Data Viz Exercises

Exercises 3.2.4

  1. Run ggplot(data = mpg). What do you see?
ggplot(data = mpg)

Nothing, because we haven’t selected a geom.

  1. How many rows are in mpg? How many columns?
nrow(mpg)
## [1] 234
ncol(mpg)
## [1] 11
dim(mpg)
## [1] 234  11
  1. What does the drv variable describe? Read the help for ?mpg to find out.
?mpg

4.Make a scatterplot of hwy vs cyl.

# set all ggplot figures to use minimal theme
theme_set(theme_classic())
mpg %>%
  ggplot(aes(x = cyl, y = hwy)) + geom_point()

# use transparency and jitter to make the points separate better
mpg %>%
  ggplot(aes(x = cyl, y = hwy)) + geom_jitter(width = 0.4, alpha = 0.5)

  1. What happens if you make a scatterplot of class vs drv? Why is the plot not useful?
theme_set(theme_minimal())
mpg %>%
  ggplot(aes(x = class, y = drv)) + geom_point()

Because it is plotting a category vs a category, so most of the space in the plot cannot be filled. However, I’d argue that it’s not completely useless as it does show that all 2 seater cars have rear wheel drive, while all minivans have forward wheel drive.

The barplot below probably presents a better visualisation, as it also shows that we may not have sampled enough 2 seater vehicles to identify whether any of them could possibly have forward drive. Having said that, I do like the dot-plot visualisation as well.

theme_set(theme_minimal())
mpg %>%
  ggplot(aes(x = class, fill = drv)) + geom_bar()


Excercises 3.3.1

  1. What’s gone wrong with this code? Why are the points not blue?
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = "blue"))

Because color has been set within the aesthetic, so ggplot is assuming that we want to set the value of the colour aesthetic to the string blue. To fix:

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy),color = "blue")

  1. Which variables in mpg are categorical? Which variables are continuous? (Hint: type ?mpg to read the documentation for the dataset). How can you see this information when you run mpg?
glimpse(mpg)
## Observations: 234
## Variables: 11
## $ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", …
## $ model        <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", …
## $ displ        <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2…
## $ year         <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 20…
## $ cyl          <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8,…
## $ trans        <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "aut…
## $ drv          <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "…
## $ cty          <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, …
## $ hwy          <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, …
## $ fl           <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "…
## $ class        <chr> "compact", "compact", "compact", "compact", "compact", "…

We can use the glimpse() command, which will show us the type of each variable. Those that are ‘chr’ (character) are categorical, whereas those that are ‘int’ (integer) are continuous.

  1. Map a continuous variable to color, size, and shape. How do these aesthetics behave differently for categorical vs. continuous variables?
mpg %>% ggplot(aes(x = displ, y = hwy, col = hwy)) + geom_point() 

For a continuous variable, colour is use to represent a gradient.

mpg %>% ggplot(aes(x = displ, y = hwy, size = hwy)) + geom_point() 

Size becomes bigger as the values get bigger

# mpg %>% ggplot(aes(x = displ, y = hwy, shape = hwy)) + geom_point() 

Shape gives an error.

mpg %>% ggplot(aes(x = displ, y = hwy, col = manufacturer)) + geom_point() 

Colour colours the points by the levels of the category.

mpg %>% ggplot(aes(x = displ, y = hwy, size = manufacturer)) + geom_point() 
## Warning: Using size for a discrete variable is not advised.

Size is not advised, but still works.

# mpg %>% ggplot(aes(x = displ, y = hwy, shape = manufacturer)) + geom_point()
# Shape by default throws an error, since only 6 shapes are allowed
mpg %>% ggplot(aes(x = displ, y = hwy, shape = manufacturer)) + geom_point() + scale_shape_manual(values=1:length(unique(mpg$manufacturer)))

Shape by default doesn’t work, but can be coerced by using scale_shape_manual() to present more than 6 shapes.

  1. What happens if you map the same variable to multiple aesthetics?

It gets mapped!

mpg %>% ggplot(aes(x = displ, y = hwy, shape = manufacturer, col=manufacturer)) + geom_point() + scale_shape_manual(values=1:length(unique(mpg$manufacturer)))

  1. What does the stroke aesthetic do? What shapes does it work with? (Hint: use ?geom_point)
mpg %>% ggplot(aes(x = displ, y = hwy, stroke = displ)) + geom_point()

Increases the thickness of the stroke as values of the variable get larger.

  1. What happens if you map an aesthetic to something other than a variable name, like aes(colour = displ < 5)? Note, you’ll also need to specify x and y.
mpg %>% ggplot(aes(x = displ, y = hwy, colour = displ < 5)) + geom_point()

The expression will be evaluated, and the variable plotted will be (displ<5).


Exercises 3.5.1 Exercises - Facets

  1. What happens if you facet on a continuous variable?
mpg %>% ggplot(aes(x = displ, y = hwy)) + geom_point() + facet_grid(~hwy)

It treats it as a categorical - so bad things!

  1. What do the empty cells in plot with facet_grid(drv ~ cyl) mean? How do they relate to this plot?
ggplot(data = mpg) + geom_point(mapping = aes(x = drv, y = cyl))

There are no cars with cyl == 7 or (drv == r where cyl ==4) or (cyl ==5 and drv ==4 or drv == r).

  1. What plots does the following code make? What does . do?
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) + facet_grid(drv ~ .)

. says not to faced on that dimension.

  1. Take the first faceted plot in this section. What are the advantages to using faceting instead of the colour aesthetic? What are the disadvantages? How might the balance change if you had a larger dataset?

    ggplot(data = mpg) +geom_point(mapping = aes(x = displ, y = hwy)) + facet_wrap(~ class, nrow = 2)

Cleaner to see the trend in each level of displ. If we had a larger dataset this would be more important, as overlaying all of the points would create a data blob instead of a meaningful visualisation.

5.Read ?facet_wrap. What does nrow do? What does ncol do? What other options control the layout of the individual panels? Why doesn’t facet_grid() have nrow and ncol arguments?

nrow and ncol specify how many rows and columns we want our panels to be split into. facet_grid() doesn’t do this, as it uses the number of factor levels in the data we’re faceting by to cleanly present this automatically.

scales is veru useful as it allows us to have free scales (i.e. different scales) for each of our individual plots.

  1. When using facet_grid() you should usually put the variable with more unique levels in the columns. Why?

Because that allows us to better see the spread of the data.

Exercises 3.6 Geometric objects

  1. What geom would you use to draw a line chart ? A boxplot ? A histogram? An area chart?
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) + geom_point() + geom_smooth(se = FALSE 
    )
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

  1. What does show.legend = FALSE do? What happens if you remove it? Why do you think I used it earlier in the chapter?

Hides the legend for that geom layer. Note that if you want to hide the legend completely, you need to include it in each geom level we present, so geom_point(show.legend = FALSE) and geom_smooth(show.legend = FALSE) for the plot above.

  1. What does the se argument to geom_smooth() do?

Specifies whether to show the standard error.

  1. Will these two graphs look different? Why/why not?
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
      geom_point() + 
      geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot() + 
      geom_point(data = mpg, mapping = aes(x = displ, y = hwy)) + 
      geom_smooth(data = mpg, mapping = aes(x = displ, y = hwy))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

No, they should be identical, because the specify the same x/y aesthetics for both geoms.

  1. Recreate the R code necessary to generate the following graphs.

I code the six plots to variables first, and then use the patchwork library to present them in one figure below:

one <- mpg %>% ggplot(aes(x = displ, y = hwy)) + geom_point() + geom_smooth(se=FALSE)
two <- mpg %>% ggplot(aes(x = displ, y = hwy)) + geom_point() + geom_smooth(aes(fill = drv), se=FALSE, show.legend = F)
three <- mpg %>% ggplot(aes(x = displ, y = hwy,col = drv)) + geom_point() + geom_smooth(se = F)
four <- mpg %>% ggplot() + geom_point(aes(x = displ, y = hwy,col = drv)) + geom_smooth(aes(x = displ, y = hwy), se = F)
five <- mpg %>% ggplot(aes(x = displ, y = hwy,col = drv, linetype = drv)) + geom_point() + geom_smooth(se = F)
six <- mpg %>% ggplot(aes(x = displ, y = hwy, fill = drv)) + geom_point(shape = 21, col = "white", stroke = 2, size = 3) + theme_gray()

one + two + three + four + five + six+ plot_layout(ncol = 2)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Exercises 3.7 Statistical transformations Exercises

1.What is the default geom associated with stat_summary()? How could you rewrite the previous plot to use that geom function instead of the stat function?

# original
ggplot(data = diamonds) +
  stat_summary(
    mapping = aes(x = cut, y = depth),
    fun.ymin = min,
    fun.ymax = max,
    fun.y = median
  )

#modified
ggplot(data = diamonds, aes(x = cut, y = depth)) + 
  geom_pointrange(stat = "summary",
    fun.ymin = min,
    fun.ymax = max,
    fun.y = median)

  1. What does geom_col() do? How is it different to geom_bar()?

It is the equivalent of geom_bar(stat=“identity”).

  1. Most geoms and stats come in pairs that are almost always used in concert. Read through the documentation and make a list of all the pairs. What do they have in common?
geom stat
geom_bar() stat_count()
geom_bin2d() stat_bin_2d()
geom_boxplot() stat_boxplot()
geom_contour() stat_contour()
geom_count() stat_sum()
geom_density() stat_density()
geom_density_2d() stat_density_2d()
geom_hex() stat_hex()
geom_freqpoly() stat_bin()
geom_histogram() stat_bin()
geom_qq_line() stat_qq_line()
geom_qq() stat_qq()
geom_quantile() stat_quantile()
geom_smooth() stat_smooth()
geom_violin() stat_violin()
geom_sf() stat_sf()

Many (but not all) have similar names.

  1. What variables does stat_smooth() compute? What parameters control its behaviour?
  1. In our proportion bar chart, we need to set group = 1. Why? In other words what is the problem with these two graphs?
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, y = ..prop..))

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = color, y = ..prop..))

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, y = ..prop.., fill = color, group = 1))

The proportions are calculated within the groups, so it’s always presented out of 100%. To get the “best” visualisation:

ggplot(data = diamonds) +
  geom_bar(aes(x = cut, y = ..count.. / sum(..count..), fill = color))

Exercises 3.8 Position adjustments

  1. What is the problem with this plot? How could you improve it?

    ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + geom_point()

 ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + 
      geom_jitter(height = 1, width  = 1, alpha = 0.6)

The points overlap. To address: use jitter and alpha.

  1. What parameters to geom_jitter() control the amount of jittering?
  1. Compare and contrast geom_jitter() with geom_count()
 ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + 
      geom_count()

Will plot the number of observations at each point as a blob instead of moving the points.

  1. What’s the default position adjustment for geom_boxplot()? Create a visualisation of the mpg dataset that demonstrates it.
mpg %>% ggplot(aes(x = as.factor(cyl), y = hwy, colour = fl)) + geom_boxplot()

Exercises 3.9 Coordinate systems

  1. Turn a stacked bar chart into a pie chart using coord_polar().
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = 1, fill = clarity)) + coord_polar(theta = "y")

  1. What does labs() do? Read the documentation.

Specify labels! x, y axes, title etc!

  1. What’s the difference between coord_quickmap() and coord_map()?
  1. What does the plot below tell you about the relationship between city and highway mpg? Why is coord_fixed() important? What does geom_abline() do?
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_point() + 
  geom_abline() +
  coord_fixed()